Introduction

This specific hurricane dataset was taken from Kaggle, which in turn was taken from National Hurricane Center (NHC). This dataset includes all tropical cyclones (tropical depression, tropical storm, hurricane, major hurricane) dated all the way back from 1851 to 2015. The most recent 2016 and 2017 Atlantic hurricane seasons weren’t included in the data.

With time series data, there are a lot of questions we could ask. We haven’t done much time series analysis in either EDA or in Stats, so we didn’t have too much technical expertise to do a lot of statistical analysis with the data. We decided on focusing purely on the visualization of the hurricanes, and more specifically, ask the questions:

1) Where do hurricanes form the most throughout the Atlantic ocean? Does this change throughout time?

2) Can we predict where a tropical cyclone could end up based on the origin of the system?

3) Which hurricanes shared the most similar paths? Can we find patterns or trends on those hurricanes?

For some of the questions, we created an R Shiny application which would allow a user to enter their desired system based on year, and find all the relevant visualizations associated with it. For the other questions, we created some very interesting plots that reveal a lot about these magnificent storms.

Cleaning

First, we imported and cleaned the data. This was crucial since this is time series data and there were things we needed to change specifically with the dates that would let R read them properly. We used the ‘lubridate’ package that made this process fairly easy.

hur <- fread(file.path("C:/Users/susmani/Documents/Year 5/Courses/Data Munging/GroupProject2/hurricane.csv"),na.strings = c("PrivacySuppressed", "NULL"))
hur<-data.frame(hur)
hur$Date<-ymd(hur$Date)
hur$Latitude<-as.numeric(unlist(strsplit(hur$Latitude, split='N', fixed=TRUE)))
hur$Longitude<-as.numeric(unlist(strsplit(hur$Longitude, split='W', fixed=TRUE)))
hur$Longitude<-(-1)*hur$Longitude

In the above code, we read in the file, converted it into a data frame that would easily be read. We then used the lubridate package to change the dates and times in this set to match the format that R likes. Lastly, and probably most important for graphing geographical maps of frequency and tracks of hurricanes, we needed to change the Latitude and Longitude points into the format which ggplot2 and ggmap enjoys.

The cleaning, as you can see, is very simple, but it allows us to create some truely mesmerizing visualizations.

The structure of our clearned dataset has over 50,000 single observations of system data throughout time, with over 1,000 unique cases. There are 22 variables in this data set, but we only focus on a few: ID, Year, Date, Time, Latitude, Longitude, Minimum Pressure, Maximum Pressure.

With this knowledge, we can continue to explore - and munge.

Basic Visualizations

First, we wanted to see what some tracks we could look at. To create some gorgeous and interactive plots, we used a combination of mapbox, an online mapping tool, and plotly, an interactive plotting tool in R. To use mapbox with plotly, we need to set a system token that links to Saad’s mapbox account:

Sys.setenv('MAPBOX_TOKEN' = 'pk.eyJ1Ijoic3VzbWFuaSIsImEiOiJjamFhZnJlb3YwczdmMzJxaXlmcHJ0ZGZ2In0.vdCC--CzL7cM-XsG8yCrFw')

Then, we can focus on one specific hurricane, like Hurricane Katrina from 2005, and plot its track:

p <- plot_mapbox(mode = 'scattermapbox') %>%
  add_markers(
    data = hur[hur$Date >= "2005-01-01" & hur$Name=='KATRINA',], x = ~Longitude, y = ~Latitude, text=~paste('Name: ', Name, '<br>Max Wind:', Maximum.Wind, '<br>Date: ', Date), split=~Name,
    size = ~Maximum.Wind, alpha = 0.5, color=~Date) %>%
  layout(
    plot_bgcolor = '#191A1A', paper_bgcolor = '#191A1A',
    mapbox = list(style = 'dark',
                  zoom = 3,
                  center = list(lat = median(hur$Latitude),
                                lon = median(hur$Longitude))),
    margin = list(l = 0, r = 0,
                  b = 0, t = 0,
                  pad = 0),
    showlegend=FALSE)
p

Looks pretty good! As maximum wind increases over time, so does to size of the dots. You can hover over the data points and get the maximum wind and date at the point in time.

We can create even more complex and colorful graphs that can really grab anyone’s attention. We plotted one track, but why don’t we try to graph the tracks of all major hurricanes that formed in September?

p2 <- plot_mapbox(mode = 'scattermapbox') %>%
  add_markers(
    data = hur[format.Date(hur$Date, "%m")=="09" & hur$Maximum.Wind>=96,], x = ~Longitude, y = ~Latitude, text=~paste('Name: ', Name, '<br>Max Wind:', Maximum.Wind, '<br>Date: ', Date), color=~ID,
    size = ~Maximum.Wind, alpha = 0.5) %>%
  layout(
    plot_bgcolor = '#191A1A', paper_bgcolor = '#191A1A',
    mapbox = list(style = 'dark',
                  zoom = 3,
                  center = list(lat = median(hur$Latitude),
                                lon = median(hur$Longitude))),
    margin = list(l = 0, r = 0,
                  b = 0, t = 0,
                  pad = 0),
    showlegend=FALSE)
p2

Again, we used mapbox and plotly in conjunction to specify hurricane winds that were only above 96 knots (equal to over 110 MPH) and only in the month of September. We also colored each track by their specific ID so we can distinguish tracks better.

We can also create histograms:

(p3 <- plot_ly(alpha = 0.6) %>%
  add_histogram(x = hur$Maximum.Wind[format.Date(hur$Date, "%m")=="09" & hur$Maximum.Wind>0], name = 'September', autobinx=TRUE) %>%
  add_histogram(x = hur$Maximum.Wind[format.Date(hur$Date, "%m")=="06" & hur$Maximum.Wind>0], name = 'June') %>%
  add_histogram(x = hur$Maximum.Wind[format.Date(hur$Date, "%m")=="07" & hur$Maximum.Wind>0], name = 'July') %>%
  add_histogram(x = hur$Maximum.Wind[format.Date(hur$Date, "%m")=="08" & hur$Maximum.Wind>0], name = 'August') %>%
  layout(barmode = "stack"))

Heat Map Visualizations

Let’s look at frequency of hurricanes by month in a 2D Density heat map.

##Creating a heatmap of origin of all hurricane data
atlantic_map<-get_map(location = "puerto rico", zoom = 4)

hur.aug.freq<-hur[format.Date(hur$Date, "%m")=="08",]
hur.sep.freq<-hur[format.Date(hur$Date, "%m")=="09",]
hur.oct.freq<-hur[format.Date(hur$Date, "%m")=="10",]
hur.nov.freq<-hur[format.Date(hur$Date, "%m")=="11",]

par(mfrow = c(2,2))

#second, plotting density maps across the
ggmap(atlantic_map, extent = "device") + geom_density2d(data = hur.aug.freq, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = hur.aug.freq, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.3), guide = FALSE)

ggmap(atlantic_map, extent = "device") + geom_density2d(data = hur.sep.freq, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = hur.sep.freq, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.3), guide = FALSE)

ggmap(atlantic_map, extent = "device") + geom_density2d(data = hur.oct.freq, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = hur.oct.freq, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.3), guide = FALSE)

ggmap(atlantic_map, extent = "device") + geom_density2d(data = hur.nov.freq, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = hur.nov.freq, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.5), guide = FALSE)

We see some interesting relationships between the peak months of August - November. The second week of Semptember is considered the climatogical peak of the season, and as we can seethe heatmap from that month is practically all over the place and covering most of the Atlantic ocean and southern coasts.

It’s even more interesting to see how close August and September tracks seem to be, with the difference seemingly being the frequency. Comparing this to October and November where there is more of a concentration with Carribean tracks. I’m sure there is more of an atmospheric and meteorogical explanation for this.

That looked fantastic, but can we see where the maximum winds were located for all systems?

hur_new<-hur[hur$Maximum.Wind>0 & hur$Minimum.Pressure>0,] #Getting rid of negative values

max_winds<-data.frame(hur_new) %>%
  group_by(ID, Minimum.Pressure, Latitude, Longitude) %>%
  dplyr::summarise(Maximum.Wind = max(Maximum.Wind))

max_winds.major<-max_winds[max_winds$Maximum.Wind>96,]

par(mfrow = c(1,1))

For all systems and where they got their peak strength:

ggmap(atlantic_map, extent = "device") + geom_density2d(data = max_winds, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = max_winds, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.3), guide = FALSE)

For all major hurricanes [>96 kt (110 mph)] and where they accumlated their peak strength:

ggmap(atlantic_map, extent = "device") + geom_density2d(data = max_winds.major, aes(x = Longitude, y = Latitude), size = 0.3) + stat_density2d(data = max_winds.major, aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 0.01,bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") + scale_alpha(range = c(0, 0.3), guide = FALSE)

It’s interesting to see that for all systems (hurricanes, tropical storms, tropical depression), the peak strength was along the east coast. But, when we only account for major hurricanes, we see a long strip through the upper carribean and Bahamas where storms accumulated their peak strength. It’s interesting to also see that both Category 5 hurricanes from 2017 (Irma and Maria) from this year also accumulated their maximum wind along this strip.

Minimum Pressure and Maximum Wind Relationship

Here, we look at Maximum Wind vs. Minimum Pressure:

hur_new<-hur[hur$Maximum.Wind>0 & hur$Minimum.Pressure>0,] #Getting rid of negative values

max_winds2<-data.frame(hur_new) %>%
  group_by(ID, Minimum.Pressure) %>%
  dplyr::summarise(Maximum.Wind = max(Maximum.Wind))

m <- loess(Maximum.Wind ~ Minimum.Pressure, data = max_winds2)

(min_p <- plot_ly(max_winds2, x = ~Minimum.Pressure, color = 'rgb(255, 0, 0)') %>%
  add_markers(y = ~Maximum.Wind, text = rownames(max_winds2), showlegend = FALSE, opacity = 0.5) %>%
  add_lines(y = ~fitted(loess(Maximum.Wind ~ Minimum.Pressure)),
            line = list(color = 'rgba(7, 164, 181, 1)'),
            name = "Loess Smoother") %>%
  add_ribbons(data = augment(m),
              ymin = ~.fitted - 1.96 * .se.fit,
              ymax = ~.fitted + 1.96 * .se.fit,
              line = list(color = 'rgba(7, 164, 181, 0.05)'),
              fillcolor = 'rgba(7, 164, 181, 0.2)',
              name = "Standard Error") %>%
  layout(xaxis = list(title = 'Minimum Pressure (mb)'),
         yaxis = list(title = 'Maximum Winds (kt)'),
         legend = list(x = 0.80, y = 0.90)))
summary(lm(Maximum.Wind ~ Minimum.Pressure, data = max_winds2))
## 
## Call:
## lm(formula = Maximum.Wind ~ Minimum.Pressure, data = max_winds2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.051  -5.927   0.282   6.443  54.002 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.301e+03  4.937e+00   263.5   <2e-16 ***
## Minimum.Pressure -1.256e+00  4.988e-03  -251.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 10550 degrees of freedom
## Multiple R-squared:  0.8573, Adjusted R-squared:  0.8573 
## F-statistic: 6.34e+04 on 1 and 10550 DF,  p-value: < 2.2e-16

Thus, we can see that the maximum winds of a tropical system are highly correlated with a lower minimum pressure. In fact, for every 1.25 mb decrease in minimum pressure, we increase by 1 kt (1.15 mph) in maximum wind for a given time.

R Shiny Web Application

We created an R Shiny Application that would help us visualize different systems and answer some questions we had about the data. The current version of the application has three visualizations that give us insight about different types of tropical systems based on origin, closest tracks, and individual Latitude/Longitude points grouped by windspeed.

  1. \(\textbf{Origin}\): The first interactive visualization is the first one you see when you open the application. You can search any hurricane, tropical storm, or depresssion by both its year and name and find the first ten other systems which started within +/- 1 Latitude/Longitude of the system you are looking at and their specific tracks. Essentially, we are creating a boundary box specifically for one point: the origin of the system you choose. This would be useful when trying to predict where a new system in the future might go. For example, we might find that a system in 2018 started in the same boundary box where Hurricane Andrew originated from. Where do we think this system can go, based on previous observations?
  1. \(\textbf{Closest Tracks}\): This does something similar to the origin visualization, but instead of just creating a boundary box on the origin, lets create a boundary box on the \(\textbf{entire}\) specified track, find all the points that are contained in that box, and count in descending order which tracks had the most points in the boundary box. This was the most simple way to find the closest tracks for a given system. We could have used a more mathematical approach by reducing sums of squares by distance (or other methods), but this was the most simplest algorithm to do what we wanted. This tab also gives you a table of those specific tracks that were most similar to the one you chose.

  2. \(\textbf{Path Finder}\): This is a more useful function and interface one could use for future systems. You could pick a latitude, longitude, and a current wind speed and it would create another boundary box within the +/1 latitude and longitude within that specific point, and any points that had winds that were +/- 5 kt. Then it lists the first 5 tracks in descending order by how close a system is close to the wind specified.